Hans Chalupsky, Information
Sciences Institute,
Student team: NO
KOJAK is an integrated suite of
link discovery tools that support
·
group detection (KOJAK Group Finder)
·
anomaly detection (KOJAK UNICORN)
·
pattern matching (KOJAK Pattern Finder)
·
relationship and graph simplification (KOJAK
Simplifier)
KOJAK generally operates on data
represented by labeled graphs where nodes represent typed entities such as
persons, organizations, events, etc. and links represent different kinds of
relationships between these entities (such graphs are also referred to as
semantic graphs or attributed relational graphs). Graphs might be represented
explicitly, or implicitly as views over relational data (e.g., stored in an
RDBMS). Optionally, graphs can be enhanced with background ontologies
and logic-based inferencing supported by the PowerLoom
KR&R system.
KOJAK is not a visualization tool
per se, but primarily a data analysis and knowledge discovery tool. Despite the different emphasis, its
components can support visual analytics tasks as demonstrated by this submission. In particular, the KOJAK Simplifier graph
simplification component supports the reduction and abstraction of large data
graphs for purposes of visualization which is what we are primarily exploiting
here. KOJAK has a simple GUI and
visualization component which we used during the initial experimentation phase. Later on, we switched to Pajek
to generate more sophisticated visualizations for the graphs generated by
KOJAK.
KOJAK has been developed over the
course of seven years primarily with DoD
funding. Chief architect and developer
of the system is Hans Chalupsky. Large portions of code, algorithms and good
ideas have been contributed by Jafar Adibi, Shou-de Lin, Eric Melz, Tom Russ and Andre Valente. More information and papers about KOJAK can be
found here.
Two Page
Summary: YES
ANSWERS:
Phone-1:
What is the Catalano/Vidro social network, as
reflected in the cell phone call data, at the end of the time period
Phone-2: Characterize the changes in the Catalano/Vidro social structure over the ten day period.
Detailed
Answer:
We worked on this challenge over the course of seven days (about 50-60
hours), probably two thirds of which were spent improving and streamlining the
tools, implementing some new functionality, figuring out how to use Pajek for
visualization, etc. The other third was
spent actually exploring the data and trying to solve the problem.
Our general approach was as follows: since our main expertise is not in
the area of visualization but in the area of link discovery, we were interested
to see whether a link discovery system such as KOJAK could still usefully
support a more interactive and exploratory visual analytics task. This is in contrast to evaluations in the
past, where KOJAK has been used primarily fully automatically, batch-processing large amounts of data with minimal user
intervention.
In our solution process we used KOJAK tools such as its UNICORN anomaly
detector and its Simplifier graph simplification component to generate leads
and data products that can then be visualized with off-the-shelf network
visualization components such as KOJAK's own simple visualizer, and later on the Pajek
package which supports more sophisticated features. Looking at these visualizations we formed
hypotheses which we then tried to support or refute using, for example, PowerLoom to query the data in many different ways, or the
KOJAK Group Finder to provide an alternative way to form the final social
network.
We started by translating the data into a couple of different formats
supported by KOJAK: one is a CSV format that can easily be loaded into a KOJAK
graph object, and the other was a logic format as supported by our PowerLoom KR&R system.
Given the relatively small size of the data, it was not necessary to
host the data on a relational database (also supported).
Our first attempt was to see whether we could use the association
strength computations underlying the KOJAK Group Finder tool (based on a mutual
information model - see [1]) to see whether there are any unusually strong
connections between any pairs of connected nodes. The Group Finder needs a set of seed nodes to
create a group and was therefore not directly applicable at this stage (it came
in useful at the end of the process).
Unfortunately, the strength measures were not very informative by
themselves.
Our next attempt was to use the KOJAK UNICORN tool [2] to find anomalous
nodes in the connection graph defined by the data. UNICORN operates on semantic graphs and works
best if there are different types of relations between nodes represented by
different edge labels. It combines these
labels in all possible ways and then computes statistical correlations between
a particular sequence of labels and a node.
These correlation values are then combined into feature vectors called
"semantic profiles" that characterize the graph neighborhood of a
node to a certain depth. Once we have a
set of semantic profiles (one for each node), we can use standard outlier
detection to find nodes with abnormal profiles (see [2] for more details). Given that this data only had "phoneCall" links, we were somewhat skeptical at this
point whether UNICORN could be usefully applied here.
Here is the result of our first attempt to run UNICORN on the connection
graph between callers, ignoring call direction and call frequency. This also shows how KOJAK is used via a
command interface that uses a Lisp-based syntax to perform operations such as
loading data and running analysis tools (it is not a Lisp program, however):
|= (find-abnormal-nodes :graph (clv graph)
:ignore-edge-direction?
true
:min-path-length
1 :max-path-length 4
:top-n
20 :k 1 :strong-outliers? true)
((<NODE 0> 0.991667)
(<NODE 306> 0.975)
(<NODE 1> 0.9)
(<NODE 5> 0.891667)
(<NODE 309> 0.883333)
(<NODE 23> 0.733333)
(<NODE 14> 0.708334)
(<NODE 2> 0.641667)
(<NODE 3> 0.633333)
(<NODE 8> 0.616667)
(<NODE 107> 0.616667)
(<NODE 19> 0.583334)
(<NODE 41> 0.583334)
(<NODE 200> 0.566667)
(<NODE 13> 0.533333)
(<NODE 52> 0.458333)
(<NODE 145> 0.416667)
(<NODE 360> 0.333333)
(<NODE 21> 0.325)
(<NODE 11> 0.3166665))
These are the top 20 most abnormal nodes among the 400 nodes in the graph
(based on all the data). Given that node
200 mentioned in the scenario description was part of this list made us hopeful
that UNICORN was picking up on something useful in this data. Actually, if our final analysis is correct,
this list already contains most of the major players and their aliases. This is somewhat surprising, given that this
data isn't really that suited for UNICORN, but what it is picking up on here is
unusual connectivity patterns (high/low path frequencies) and their
combinations which seems to give us some interesting starting points. To give UNICORN more label combinations to
work with, we mapped call frequencies between parties onto symbolic high/low
and later high/medium/low ranges which seemed to work well.
Our next step was to use the KOJAK Simplifier tool. The Simplifier first computes a set of global
abnormal nodes using UNICORN and then a set of locally abnormal nodes connected
to the global seeds. It then fills in
the graph between the global and local nodes given some user-specified
parameters. The result is a simplified
graph that contains important nodes with relevant connections between
them. The hope is that such a simplified
graph shows important elements and structure of the data. The following command produced the graph
shown in Figure 1 which was one of the first visualizations generated during
our investigation:
|= (simplify-graph :graph (clv graph)
:ignore-edge-direction?
true
:min-path-length
1 :max-path-length 4
:top-global-N
10 :top-local-N 5 :k 1
:max-connect-length
2
:strong-outliers?
true)
The parameters control how many global and local nodes are produced, and
how edges are filled in in the resulting graph. Here we add all paths up to
"max-connect-length" 2 between global and local abnormal nodes. It generally requires some experimentation
with these parameters to generate a graph with the right amount of
information. Ideally, this would be done
interactively with immediate feedback in a GUI, for now, we have to run
multiple times and load the graph into a visualization tool to see the result.
Figure 1: Initial simplification experiment using KOJAK's native visualization package
The layout in Figure 1 was generated with KOJAK's
own visualization tool that uses a simple spring-based layout model. The graph shows some very interesting
structure with nodes 200, 0 and 300 in the middle (200 was manually pulled out)
and the wheels around 1, 2 and 5. After
some further experimentation with the simplification parameters and a detour
into producing visualizations with Pajek instead of
KOJAK, we get the network in Figure 2:
Figure 2: Catalano network at Day 10
The double wheel patterns shown in Figures 1 and 2 made us suspect that the wheel centers are actually aliases of the
same person. This in turn triggered an
investigation into how to ascertain or refute such alias hypotheses. We added some additional functionality to
find entities with likely duplicates based on their connectivity patterns (a
dual functionality to the outlier computation), and all the double wheel
centers turned out to be candidates. Moreover, we started to suspect that 0,
200 and 300 might all be the same person as well,
however, the similarities there were not as striking.
Again we were looking for ways to refute such alias hypotheses. The simplest constraint is that two suspected
aliases should not call each other which is satisfied
by all the candidates. A more
sophisticated constraint is that they should never be at two different places
at the same time (unless phones are shared by different people). This led us to try to use the cell tower
location information and times between calls to see whether there were two
calls close in time by, for example, 200 and 300, that
were done in two locations too far from each other for the same person. We represented tower location with 16
approximately 5km zones (in PowerLoom) approximately linearizing the island.
Unfortunately, the constraint wasn't satisfied even for a single phone
assuming driving as the fastest way of transportation. For example, about 10% of the calls of 309
were problematic in this respect (maybe they used a helicopter or plane). As a side-effect of this, however, when we
compared the calls of 1 and 309 using a PowerLoom
query, we discovered that they occupied disjoint time ranges which in turn very
much supported the initial alias hypotheses.
The same was true for 200 and 300 (with 300 exclusively calling on days
8-10). The problem was that they didn't
have any connectivity overlap. However,
using the previous stronger alias hypotheses and a little bit of alias rule
modeling in PowerLoom, we could easily see that they are connected to the same
people:
|= (retrieve all
(connected-to p200 ?x) :sort-by :values)
There are 6 solutions:
#1: ?X=p1
#2: ?X=p137
#3: ?X=p2
#4: ?X=p3
#5: ?X=p5
#6: ?X=p97
|= (retrieve all
(connected-to p300 ?x) :sort-by :values)
There are 7 solutions:
#1: ?X=p126
#2: ?X=p268
#3: ?X=p306
#4: ?X=p309
#5: ?X=p332
#6: ?X=p360
#7: ?X=p397
|= (retrieve all (and
(connected-to p200 ?x)
(connected-to p300 ?x)))
No solutions. ;;;
that is no overlap here
|= (assert (and (alias p1
p309)
(alias
p2 p397)
(alias
p3 p360)
(alias
p5 p306)))
(|P|(alias
p1 p309) |P|(alias p2 p397) |P|(alias p3 p360) |P|(alias p5 p306))
|= (retrieve all (and
(connected-to p200 ?x)
(connected-to p300 ?x))
:sort-by
:values)
There are 4 solutions: ;;; now we have
overlap
#1: ?X=p1
#2: ?X=p2
#3: ?X=p3
#4: ?X=p5
That is, after asserting the alias hypotheses, 4 out of 6 friends of 200
overlap with those of 300 and they are all with the suspected core network. Given that and the temporal disjointness led us to assume that they are the same
person.
We then time-sliced the network shown in Figure 2 using its nodes (and
layout) and filling in more and more links as days progressed. As expected, there is a big jump between days
7 and 8 as shown in Figures 3 and 4. All
of the suspected aliases (except 360) appear the first time on day 8. We don't have enough background information
to produce a good explanation for this, except that maybe some external event
forced the inner circle of the Catalano network to change to different phones,
maybe to avoid detection. Another
possible explanation is the complete takeover of the group by different
players, but the alias hypothesis seems more plausible to us. Note that the wheels in Figure 4 are due to
the fact that we fill in with paths of length 2 which now connect the newly
occurring aliases.
Figure 3: Catalano
network time slice for Day 7
Figure 4: Catalano
network time slice for Day 8
At this point we've made up our mind that 5 (and 306) is Estaban based on the frequency of calls to 200 (300) and 1
(and 309) is David Vidro based on his role of
handling much of the communication. We assume 2 and 3 and their aliases are Juan
and Jorge Vidro, but there is no information that
would allow us to distinguish between them, so we made a somewhat arbitrary choice. The final question is the role of 0? Looking again at connectivity, we see that 0
and 200 and 300 are connected to the full inner circle (1,2,3,5)
and no other people are. We suspect that
0 is another alias for Ferdinando, also merging the
two preserves the high-frequency link to Estaban and
gives us a better acount for all the main players,
but it is only a guess.
Finally, assuming we've found the players of the core group, it is time
to find the complete extent of the relevant Catalano network. Now the GroupFinder
comes in handy, since now we have a seed group of 4 to start with. The GroupFinder
produces the following likely additional members in decreasing order of
strength:
145
107 276 31 48 32 34 12 170 179 27 97 137 37 69 8 38
There are more candidates, but we threshold the group based on the number
of connections to the core group (we require at least two). Again, without further information this
decision is somewhat arbitrary. The
resulting network is shown in Figure 5.
Figure 5: The final Catalano/Vidro social network generated by our analysis
References
[1] J. Adibi,
H. Chalupsky, E. Melz and A. Valente.
The KOJAK
Group Finder: Connecting the Dots via Integrated Knowledge-Based and
Statistical Reasoning. In Proceedings
of the Sixteenth Innovative Applications of Artificial Intelligence Conference
(IAAI-04), 2004
[2] S. Lin and H. Chalupsky. Discovering
and Explaining Abnormal Nodes in Semantic Graphs. IEEE Transactions on Knowledge and Data
Engineering. To appear. [draft pre-print]